Instructions

  1. All data sets used in the case studies are at: https://github.com/yichenqin/dataviz/

  2. For each question in each case study, your answer should include: R code, R output, and interpretation of the R output. Below is an example.

              Q1: what is the number of observations?
              Answer: I use dim() to find out of the number of observations.
d = read.csv("college.csv")
dim(d)[1]
## [1] 1269
              So there are 1269 observations.
  1. You can use either Microsoft Word software or Rmarkdown to prepare your report. To use Microsoft Word, just copy and paste R code and output from RStudio to Word.

  2. Please submit your report (in doc/html) to Canvas before the deadline.

  3. Note that 20% of the total grade depend on how closely your answers follow the visualization principles and requirement checklist.

Case 1 (See Canvas for due date)

Questions

  1. (5 points) Create a working directory, and download all three data files from Canvas (CVG_Flights.csv, airlines.csv, and airports.csv) to the directory. Read these files into R as three data frames. Here is the code sample/template. flights = read.csv("CVG_Flights.csv", header = TRUE, na.strings = "") Note that na.strings = "" turns blank "" to NA.

  2. (10 points) How many rows and columns are there in each data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent? Explain the meanings of variables to the best of you understanding.

  3. (5 points) Merge all three data frames into one data frame according to the IATA code of airlines and airports. For airports.csv, please merge it according to both the origin and destination airports in CVG_Flights.csv, which means you need to merge twice. Here is the code sample/template. merged_data <- left_join(flights_data, airlines_data, by=c("AIRLINE"="IATA_CODE"))

  4. (5 point) For this merged data set, print the first six rows.

  5. (5 points) For this merged data set, are there any missing values? In what variables are these missing values? What is the percentages of missing values for each variable (i.e., the number of missing values divided by the total number of observations)?

  6. (10 points) What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there?

  7. (10 points) For DEPARTURE_TIME, are there missing values? Do we know why these values are missing? Hint: canceled flights?

  8. (10 points) In the merged data frame, create a new variable (i.e., new column) as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME, i.e., SCHEDULED_TIME - ELAPSED_TIME. Print the first six elements of the new variable.

  9. (10 points) Extract the observations (i.e., rows) with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.

  10. (10 points) Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?

  11. (10 points) Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Sort these airports descendingly according to the average departure delay time and print the top six rows, i.e., top six airports and their average delay times. Which ORIGIN_AIRPORT has the longest and shortest average department delay?

  12. (10 points) For flights departing from CVG airport, count how many flights are offered by each airline. Print the entire list.

Case 2 (See Canvas for due date)

In this homework, we will learn from the pioneer in data visualization, Hans Rosling, and try to recreate one of his visualization. Watch the Hans Rosling’s presentation and take a look at his gapminder website which shows the same visualization but with higher resolution images.

Questions

1 (100 points). Please replicate the Hans Rosling’s visualization as closely as possible using ggplot. You only need to select one year to replicate. Try your best to replicate the symbol colors, shapes, sizes, axes, ticks, labels, text, grids, background colors, background text, and etc. Of course, it is impossible to replicate everything exactly the same. The visualization below should be your target.

A slightly different version of the data in this visualization is available in the R package gapminder. You can install the package by using install.packages("gapminder") and load the data using data(gapminder). Below is an acceptable example of the replication based on the gapminder package.

Useful ggplot Functions

Grading Instructions

The grade depends on:

  1. (40 points) Replication of the symbol size (5 points), color (5 points), transparency (5 points), shape (5 points), layout (10 points), correct choice of the variable (10 points).
  2. (30 points) Replication of the axes, such as transformation (5 points), tick marks (10 points), text (5 points), labels (10 points), ranges (10 points).
  3. (20 points) Replication of the background, such as text (5 points), grid (10 points), color (5 points).
  4. (10 points) Anything else that is in the visualization that you feel you can replicate.

Case 3 (See Canvas for due date)

For this case study, we will analyze a data set on college admission in college.csv. In the data set, each row represents one university and each column represents one variable. The continuous variables are admission_rate, sat_avg, undergrads, tuition, faculty_salary_avg, loan_default_rate, median_debt, lon, and lat. The categorical variables are name, city, state, region, highest_degree, control, and gender.

Questions

1 (5 points). For college.csv, how many variables and how many observations are there in the data? Are there missing values in the data?

2 (10 points). Pick one continuous variables and visualize its distribution. Pick one categorical variables and visualize its distribution. For continuous variables, you can choose from histogram, density plot, violin, and many others. For categorical variables, you can choose from barplot and many others. Describe what you observe in the visualization.

3 (15 points). Pick three pairs of variables (i.e., continuous vs continuous, continuous vs categorical, and categorical vs categorical). For each pair of variables, visualize the association between them. Make sure there are some meaningful patterns in your visualization. Describe what you observe in the visualization.

4 (10 points). Visualize the association between a pair of variables (of your choice) conditional on a third variable (of your choice). Make sure there are some meaningful patterns in your visualization. Describe what you observe in the visualization. This is similar to the previous questions but your visualization involves three variables. You can often use color, shape, size, facet to represent the third variable.

5 (10 points). Visualize the association/interaction among four variables (of your choice). Make sure there are some meaningful patterns in your visualization. Describe what you observe in the visualization. This is similar to the previous questions but your visualization involves four variables. You can often use color, shape, size, facet to represent the third and fourth variables.

6 (50 points). Propose two questions you are interested in about this data set, and then answer these questions using visualization. Some example questions can be:

Please propose your own questions and do not use the exactly same questions listed above. Note that this question is similar to the questions for final project.

Case 4 (See Canvas for due date)

For this case study, we will analyze a data set country_stat.csv on different countries’s GDP, population, life expectancy, infant mortality rate, fertility rate, continent, and region, measured over years. Each row represents one country in a particular year. Continuous variables include year, GDP, population, life expectancy, infant mortality rate, fertility rate. Discrete variables include continent and region.

Questions

  1. (10 points) Are there missing values in the data? If so, can you show how data is missing? (open-ended question)

  2. (5 points) How many unique countries are included in the data? How many years of observations are included in the data?

  3. (5 points) In the data, create a new variable called GDP_per_capita which equals to GDP/population.

  4. (80 points) Propose four questions you would like to know about this data (20 points per question). At least one question needs to be related to time series and be answered using time series data visualization. Some example questions can be: Does the developing countries grow slower than the developed countries? Is Africa catching up with world or left behind? Is the world more divided now than it was 50 years ago? Please propose your own questions and do not use the exactly same questions listed above.

Case 5 (See Canvas for due date)

For this case study, we will analyze three data sets on CVG flights, CVG_Flights.csv, airlines.csv, and airports,csv, and prepare a report summarizing your results. In CVG_Flights.csv, each row represents one flight either to or from CVG during 2015 January to March. Columns/variables include flight’s information such as date, flight number, delay time, origin and destination airports, departure time, air time, distance, and cancellation. In airlines.csv, each row represents one airline company. Columns/variables include airline names. In airports.csv, each row represents one airport. Columns/variables include airport names, city, state, longitude and latitude.

Questions

  1. For CVG_Flights.csv, how many variables and how many observations in the data? Are there missing values in the data? If so, can you show how data is missing?

  2. For each variable in the data set, please describe what you observe, such as some summary statistics, their distributions, and etc.

  3. Visualize the association between two variables of your choice. Check to see if there is an interesting relationship worth mentioning. If so, you can explore further and visualize what you have found.

  4. Visualize the association between some variable pairs (of your choice) conditional on some other variables (of your choice). This is similar to the previous questions but your visualization involves more than two variables.

  5. Merge all three data sets CVG_Flights.csv, airlines.csv, and airports,csv according to the IATA code for airlines and airports. This is the same as one of the questions in case study 1.

  6. Based on the merged data set (i.e., merge CVG_Flights.csv, airlines.csv, and airports,csv by the airline and airport IATA codes), propose four questions you would like to know about this data. Then answer these questions using visualization. One of these questions need to be related to time series data visualization. Another one of these questions need to be related to spatial data visualization. Some example questions can be: Does CVG offer more flights to east coast than west coast? Which region in USA involve more delay in flights? How are the airport distributed around USA. Do the average delay time or cancellation rate change from week to week? Please propose your own questions and do not use the exactly same questions listed above.

Case 6 (See Canvas for due date)

In this homework, we will first learn from the pioneer and visionary in data visualization, Florence Nightingale, and try to replicate and improve upon one of her visualizations, Nightingale’s rose chart. Nightingale revolutionized nursing and was also a mathematician who knew the power of a visible representation of information. Below is the Nightingale’s rose chart. Please see the link here for detailed description of the background information.

Questions

1 (10 points). For the Nightingale’s rose chart visualization, what are the strengths of this visualization? What do you like about this visualization? Note that this was generated before all the modern technology became available. In addition, what are some of the weakness in this visualization?

2 (30 points). Please replicate this visualization as closely as possible using R. Obviously, it is difficult to replicate everything single detail in the visualization. You should try to replicate as much as you can. The data for this visualization can be downloaded at https://github.com/yichenqin/dataviz/blob/main/data/Nightingale.RData Click “Download” to download the RData file, and use load("Nightingale.RData") function to load the data into R. You can use the variables of rates, i.e., Disease.rate,Wounds.rate, and Other.rate. Here are some useful ggplot functions

4 (60 points). Try to improve upon this visualization based on your identified weaknesses by creating a new visualization. It can be of different visualization types.

Submission and Grading Instructions

The grade of question 2 depends on

The grade of question 3 depends on